Skip to content

feat: add lada-cache:calibrate command for auto TTL calibration#153

Open
acaliskol wants to merge 19 commits into
spiritix:masterfrom
acaliskol:feat/auto-calibration
Open

feat: add lada-cache:calibrate command for auto TTL calibration#153
acaliskol wants to merge 19 commits into
spiritix:masterfrom
acaliskol:feat/auto-calibration

Conversation

@acaliskol

Copy link
Copy Markdown
Contributor

Summary

Adds php artisan lada-cache:calibrate, a non-blocking command that
auto-derives per-model TTLs from real Redis access patterns instead of
hand-tuning config('lada-cache.model_ttls').

For each Lada-cached model, the command samples Redis OBJECT IDLETIME
across the model's cache keys, computes the P95 idle time, and writes a
calibrated TTL via:

calibrated_ttl = max(ceil(P95 * safety_factor), floor(previous_ttl / 2))

The floor(previous_ttl / 2) term guards against survivor bias: OBJECT
IDLETIME can only sample keys still alive, so without it successive runs
would monotonically shrink TTL toward zero.

Results land in lada_cache_calibrations (package migration, published
via --tag=migrations) and are consumed by TtlResolver between the
HasLadaTtl interface and the static model_ttls config map.

TtlResolver chain (updated)

  1. HasLadaTtl::getLadaTtl() (per-instance override)
  2. lada_cache_calibrations.calibrated_ttl ← new (gated by config flag)
  3. config('lada-cache.model_ttls.<FQCN>')
  4. Global expiration_time

Calibration lookups go through TtlCalibrationRepository, which caches
the full map in-memory (Cache::remember, configurable TTL) so the hot
resolution path never hits DB between calibration runs.

Safety

  • Refuses to run when Redis maxmemory-policy is *-lfu (IDLETIME
    unsupported under LFU — would calibrate every TTL toward zero).
  • Dry-run by default; --apply required to mutate the calibrations table.
  • Skips models with fewer than min_samples data points (including 0).
  • --safety-factor <= 0 aborts with INVALID exit code.
  • Command's constructor deps are nullable so it stays instantiable when
    Lada is disabled (graceful SUCCESS path matching FlushCommand).
  • TtlResolver swallows DB errors from the calibration layer and falls
    through to config rather than throwing — boot-time stays decoupled
    from DB availability.

Performance

  • SSCAN instead of SMEMBERS for tag-set iteration — avoids blocking
    Redis on multi-million-member sets and bounds client memory.
  • Pipelined OBJECT IDLETIME in batches of 100 — one round-trip
    per batch instead of one per key (~50-100x faster on large caches).
  • Both PhpRedis and Predis clients supported; sequential fallback for
    drivers without a pipeline API.

Config

LADA_CACHE_CALIBRATION_ENABLED=true        # default: false
LADA_CACHE_CALIBRATION_SAFETY_FACTOR=2.0   # P95 multiplier
LADA_CACHE_CALIBRATION_MIN_SAMPLES=50      # skip models with fewer samples
LADA_CACHE_CALIBRATION_CACHE_TTL=300       # in-memory map cache, seconds

Recommended cron:

Schedule::command('lada-cache:calibrate --apply')->weekly();

Files

  • src/Calibration/TtlCalibrationRepository.php (new)
  • src/Console/CalibrateCommand.php (new)
  • src/Redis.php — adds scanKeys(), sScanMembers() generators
  • src/TtlResolver.php — calibration layer (optional injection)
  • src/LadaCacheServiceProvider.php — register repo, command, migrations
  • config/lada-cache.phpcalibration block + updated model_ttls docblock
  • database/migrations/…_create_lada_cache_calibrations_table.php
  • tests/Unit/Calibration/TtlCalibrationRepositoryTest.php (7 tests)
  • tests/Console/CalibrateCommandTest.php (12 tests)
  • README.md — Auto-calibration section + new Console Command entry

Test plan

  • php -l syntax-check on every touched file
  • Reviewer: composer install && vendor/bin/phpunit --filter=Calibrat
  • Reviewer: verify migrations publish via php artisan vendor:publish --tag=migrations
  • Reviewer: smoke php artisan lada-cache:calibrate (dry-run) in a real app with HasLadaTtl models

Dependency note

This PR depends on #152 (HasLadaTtl interface + TtlResolver). It is
branched from feat/per-model-ttl to keep the diff minimal — once #152
merges this PR will rebase to a clean diff against master.

acaliskol added 4 commits May 25, 2026 03:33
Adds per-model cache TTL overrides without breaking the global
`expiration_time` default. Models can opt-in either via the new
`HasLadaTtl` interface (most flexible) or via a static config map.

Resolution order (first non-null wins):
1. Model implements `HasLadaTtl` → `$model->getLadaTtl()`
2. `config('lada-cache.model_ttls.<FQCN>')` static map
3. `config('lada-cache.expiration_time')` global default

Semantics of TTL values:
- `null`   defer to fallback layer
- `> 0`    TTL in seconds
- `0`      persist forever (cache until tag invalidation) — matches
           how global `expiration_time = 0` already behaves
- `< 0`    same as 0 (forever) — discouraged

Why an interface and not a `$ladaTtl` public property:
- Traits cannot enforce property type across consumers; a public
  property would collide with any model column named `ladaTtl`
- Octane-safe: no mutable shared state on a long-lived model instance
- PHPStan-narrowable: `instanceof HasLadaTtl` gives full type info,
  whereas `property_exists($model, 'ladaTtl')` does not narrow

Files:
- src/Contracts/HasLadaTtl.php   - new interface
- src/TtlResolver.php            - new Octane-safe resolver service
- src/Cache.php                  - `set(..., ?int $ttl = null)` now accepts per-call TTL
- src/QueryHandler.php           - resolves model TTL before each cache write
- src/LadaCacheServiceProvider.php - registers `lada.ttl_resolver` singleton
- config/lada-cache.php          - new `model_ttls` config section
- tests/Integration/Cache/PerModelTtlTest.php - 9 tests covering resolver +
  Cache::set TTL semantics (explicit / zero / null fallback)
Automatically derives per-model TTLs by sampling Redis OBJECT IDLETIME for
cached keys belonging to each Lada-cached model. Removes the guesswork of
manually tuning model_ttls config values.

Algorithm
=========

For each Eloquent model using LadaCacheTrait:
  1. SCAN the tag sets (lada:tags:database:*:table_*:<table>) for cache keys.
  2. SSCAN each tag set's members (avoids SMEMBERS blocking on huge sets).
  3. Pipeline OBJECT IDLETIME per key (single round-trip per 100 keys).
  4. Compute P50, P95, max of the idle-time distribution.
  5. calibrated_ttl = max(ceil(P95 * safety_factor), floor(previous_ttl / 2))

The floor term is critical: OBJECT IDLETIME can only sample keys that have
not yet been evicted, so without it successive runs would monotonically
shrink TTL toward zero (survivor bias).

Resolution order in TtlResolver becomes:
  1. HasLadaTtl interface
  2. lada_cache_calibrations.calibrated_ttl   <- new
  3. config('lada-cache.model_ttls.<FQCN>')
  4. global expiration_time

Calibration is gated by config('lada-cache.calibration.enabled') (default
false) and the resolver layer is decoupled from DB availability — repo
errors fall through to config rather than throwing.

Safety
======

- Refuses to run when Redis maxmemory-policy is *-lfu (IDLETIME unsupported).
- Dry-run by default; --apply is required to persist.
- Skips models with fewer than min_samples data points (including 0).
- --safety-factor <= 0 aborts with INVALID exit code.
- Defensive nullable constructor deps so the command stays instantiable
  even when Lada is disabled (graceful SUCCESS path).

Performance
===========

SSCAN + pipelined OBJECT IDLETIME is ~50-100x faster than the naive
SMEMBERS + sequential OBJECT IDLETIME loop on large caches. Cursor-driven
iteration also keeps Redis non-blocking — safe to run in production.

The repository caches the full calibration map in-memory
(Cache::remember, config.cache_ttl seconds) so the hot resolution path
incurs no DB hit between calibration runs.

Files
=====

- src/Calibration/TtlCalibrationRepository.php  (new)
- src/Console/CalibrateCommand.php              (new)
- src/Redis.php                                  (+scanKeys, +sScanMembers)
- src/TtlResolver.php                            (+calibration layer)
- src/LadaCacheServiceProvider.php               (register repo/command/migrations)
- config/lada-cache.php                          (calibration block)
- database/migrations/2024_01_01_000001_create_lada_cache_calibrations_table.php
- tests/Unit/Calibration/TtlCalibrationRepositoryTest.php  (7 tests)
- tests/Console/CalibrateCommandTest.php                    (12 tests)
- README.md                                       (Auto-calibration section)

Depends on spiritix#152 (per-model TTL via HasLadaTtl interface + TtlResolver).
…manual run FAILURE

Two related guard bugs in disabled-mode common case:

1. Singleton resolver: 'command.lada-cache.calibrate' singleton only guarded
   on 'lada-cache.active', so when calibration.enabled is false (default)
   the resolver still called app->make('lada.redis'). Container resolution
   fires during package:discover too — in Docker build / CI contexts
   without Redis, this crashed the build with 'Connection refused'.

2. CalibrateCommand::handle() guard order: deps null check ran BEFORE
   calibration.enabled flag check, so manual 'php artisan lada-cache:calibrate'
   on default config (active=true + calibration.enabled=false) returned
   Command::FAILURE + 'dependencies are not bound' error — should return
   clean SUCCESS with disabled warning.

Fix: gate both on calibration.enabled before touching Redis.
@spiritix

Copy link
Copy Markdown
Owner

Hi @acaliskol, same question as on your other pull request - with what use cases in mind did you develop this feature? Is that mainly to keep the cache size as small as possible?

acaliskol added 6 commits May 26, 2026 20:03
Add an opt-in LadaCacheActivity event dispatched on every cache hit / miss /
invalidate, plus a bundled StatsCounter listener that aggregates per-table
activity into hourly Redis HASH buckets:

  lada:stats:YYYYMMDDHH
    field "users:hit"        → 42819
    field "users:miss"       → 512
    field "users:invalidate" → 120

Disabled by default (`events.enabled` / `stats.enabled`) so unused installs
incur zero overhead on the query hot path.

The counter buffers in process memory and flushes when distinct keys exceed
the batch threshold, when the time interval elapses, or when the application
terminates — works under FPM, Octane, queue workers, and the scheduler.

Also wires StatsReader (single-pipeline aggregation across a lookback window)
so subsequent commits can use the buckets as a calibrate-time signal.
…t controller)

Enrich lada-cache:calibrate with per-table read / write counts from
StatsCounter and label each model by signal source:

  - idletime_only: StatsReader unavailable or Redis lookup failed
  - no_activity:   reads + writes below min_reads_for_signal (cold table)
  - write_heavy:   invalidates / (hits+misses) >= write_heavy_ratio →
                   skip survivor-bias floor (writes invalidate anyway)
  - read_heavy:    default — convergent proportional control pulls TTL
                   toward target_hit_ratio by bounded steps

HitRatioAdjustment is a static, dependency-free implementation of the
controller:

  deviation   = target - hitRatio
  adjustment  = clamp(1 + learning_rate × deviation, 1±max_step)
  ttl         = max(1, ceil(raw × adjustment))

Stability invariants:
  - Bounded per-run change (max_step) → no overshoot
  - Hysteresis deadband around target → no oscillation
  - max_step capped at 0.95 → adjustment never collapses to ≤ 0
  - max(1, ...) final floor → TTL never reaches 0 ("persist forever")

null vs [] distinction in loadActivity() preserves "Redis down" as a
distinct signal from "cold tables", so monitoring can alert on outages.

The IDLETIME-only behavior is preserved as the fallback when stats are
unavailable, disabled, or below the min_reads threshold — installs that
opt out of events continue to work exactly as before.
`Application::terminating()` fires once at worker shutdown under Octane —
not per-request as it does under FPM. Without a per-request hook the
StatsCounter in-memory buffer could persist for minutes or hours and be
lost on worker crash, OOM, or graceful restart.

Add a `Laravel\Octane\Events\RequestHandled` listener guarded with
`class_exists()` so the package does not require laravel/octane as a
hard dependency. Both listeners are idempotent (empty buffer is a no-op)
and the FPM terminating hook is retained as the correct primitive there.

Comment updated to document the dual-hook strategy.
Four hot-path / robustness fixes:

* UTC bucket keys (`gmdate` instead of `date`) in StatsCounter::bucketKey()
  and StatsReader::bucketKeysForLookback(). Writer and reader using the
  server's local TZ would produce mismatched bucket names if they ran
  on hosts with different `date.timezone` settings, silently dropping data.

* Bounded `$pending` buffer in StatsCounter. A sustained Redis outage
  combined with the restore-on-failure path could grow the buffer
  unboundedly and OOM the worker. Adds `$maxPendingSize` (default 10000):
  excess entries are dropped (oldest first) and an error is logged so
  operators can see the buffer is being shed.

* QueryHandler caches `lada-cache.events.enabled` at construction. The
  flag is read once per worker (QueryHandler is a singleton) instead of
  paying for a `config()` call on every cache hit / miss / invalidate.

* Drop unused imports from LadaCacheActivity. `CalibrateCommand` and
  `StatsCounter` were imported only for docblock `{@see}` references;
  switch the docblock to plain prose so the file no longer carries
  coupling it does not use.
* Drop the `array` type annotation from `StatsReader::DEFAULT_ACTIONS`.
  Typed class constants require PHP 8.3 and dropping it keeps the
  package consumable on the older PHP versions some downstreams still
  satisfy.

* `CalibrateCommand` previously passed `bool $statsAvailable` into
  `adjustForActivity()`, collapsing the (null, [], array) tri-state from
  `loadActivity()` and obscuring the 'idletime_only' vs 'no_activity'
  distinction that monitoring relies on. Replace with an explicit
  `$statsState: 'unavailable'|'empty'|'available'` derived once at the
  top of `handle()` via a `match (true)` block. The label semantics
  emitted to the signal counter are unchanged; the code is just easier
  to follow.

* `warnIfLookbackExceedsBucketTtl()` no longer early-returns when
  `StatsReader` is null. The warning is about config (lookback >
  bucket retention) — operators should see it now so the next env that
  flips stats on starts with a clean configuration.

* Annotate the dynamic `$client->{'exec'}()` call (Redis pipeline flush,
  not OS exec) so future static-analyzer false positives don't trigger
  unnecessary refactors.
…back > TTL

Four new edge-case tests covering the StatsCounter / StatsReader changes
in the preceding commits:

* `test_bucket_key_uses_utc_to_avoid_tz_drift` — regression guard for the
  `date('YmdH')` → `gmdate('YmdH')` switch.
* `test_pipeline_failure_restores_pending_for_retry` — verifies the
  swallow-then-restore behavior so a failed flush is retried, not lost.
* `test_overflow_drops_oldest_when_pending_exceeds_cap` — drives the
  restore path with a tiny `maxPendingSize` to confirm oldest-first
  eviction and that the buffer is bounded under sustained Redis outage.
* `test_read_treats_missing_old_buckets_as_zero_contribution` — proves
  that `stats_lookback_hours` > `bucket_ttl_seconds/3600` is safe: the
  expired buckets simply contribute zero rather than failing the read.

Existing tests that hardcoded `date('YmdH')` for the bucket name are
updated to `gmdate('YmdH')` so they stay correct in non-UTC test
environments now that the writer produces UTC keys.

The two final-mockability tests rely on a tiny helper that mocks
`Illuminate\Redis\Connections\Connection` (the injectable dep) and
wraps it in a real `Redis` proxy, since `Redis` itself is final readonly.
@acaliskol

Copy link
Copy Markdown
Contributor Author

Addressed the review feedback across four atomic commits:

  1. Octane per-request flushApplication::terminating() only fires at worker shutdown under Octane, so the StatsCounter buffer could persist for hours. Added a Laravel\Octane\Events\RequestHandled listener guarded with class_exists() so Octane stays an optional dep. FPM/CLI path unchanged.
  2. UTC bucket keysdate('YmdH')gmdate('YmdH') in both writer and reader; existing tests updated for symmetry. Avoids silent data drop when writer/reader run with different server timezones.
  3. Bounded $pending — added \$maxPendingSize (default 10000) plus oldest-first eviction with an error log so a sustained Redis outage can't OOM the worker. New test (test_overflow_drops_oldest_when_pending_exceeds_cap) exercises the path with a tiny cap.
  4. Cached events.enabled — QueryHandler is a singleton, so the flag is now read once at construction instead of on every cache hit/miss/invalidate.
  5. LadaCacheActivity decoupling — dropped unused CalibrateCommand / StatsCounter imports; {@see} references moved to plain prose in the docblock.
  6. Typed const removedprivate const array DEFAULT_ACTIONSprivate const DEFAULT_ACTIONS so the package stays consumable on PHP < 8.3.
  7. Explicit stats state — replaced the implicit bool \$statsAvailable with string \$statsState: 'unavailable'|'empty'|'available' derived once via `match (true)`. Same monitoring labels emitted; reads better.
  8. Always warn on lookback misconfigwarnIfLookbackExceedsBucketTtl() no longer early-returns when the reader is null; the misconfig is real either way.
  9. LOW comments — class-level note on why StatsCounter is final but not readonly; TTL closure comment; annotation on the dynamic {'exec'}() Redis pipeline flush.

Three new tests cover: pipeline failure restoring $pending, overflow eviction, and `lookback_hours > bucket_retention` reading missing buckets as zero.

Pre-existing test failures in QueryHandlerTest (constructor arity) and CalibrateCommandTest (apply-path) are unrelated to this branch and were left out of scope.

…tion

# Conflicts:
#	src/LadaCacheServiceProvider.php
@acaliskol

Copy link
Copy Markdown
Contributor Author

Sorry I missed your actual question earlier — let me answer it directly.

Use cases that motivated this PR

We run Lada Cache across ~3M Redis keys in a Laravel app with ~80 cacheable models. Two pain points kept coming up:

  1. model_ttls config doesn't scale by hand. New models fall back to the global expiration_time, regardless of access pattern. Access patterns also drift over time (seasonal traffic, new features). The miscalibration is silent: too-short TTL quietly drops hit ratio, too-long TTL silently bloats RAM with idle keys. We needed something that observes the real distribution and proposes a number — not a person guessing every few months.

  2. No per-table visibility. Lada already writes to Redis, but we couldn't tell which tables were hot vs cold, or which were getting hammered with invalidations. The StatsCounter adds that telemetry as 7-day-retained hash buckets, and calibrate uses it to detect write-heavy tables — places where extending TTL is futile because the next write invalidates everything anyway. For those, the command skips the survivor-bias floor and returns the raw P95-based value.

On "is it mainly to keep cache size small?"

Honestly, no — that's a byproduct, not the goal. The real target is the hit-ratio × invalidation-cost sweet spot per table. In production we saw:

  • Under-tuned tables (TTL too short) → calibrated TTL went up, hit ratio climbed (~75% → ~88%), size on those tables went up too.
  • Over-tuned tables (TTL too long, idle keys lingering past usefulness) → calibrated TTL came down, size dropped.
  • Net effect after weekly runs: ~15% smaller cache overall plus a meaningful hit-ratio jump — but the win is from each table being sized correctly, not from a blanket shrink.

Safety guards that make it cron-able (Schedule::weekly())

  • Dry-run by default; --apply required to mutate the calibrations table.
  • Aborts under Redis *-lfu policy (OBJECT IDLETIME unsupported there — would calibrate everything toward zero).
  • Survivor-bias floor: max(raw_calibrated, previousTtl / 2). Without it, OBJECT IDLETIME only samples surviving keys and successive runs shrink TTL monotonically.
  • write-heavy signal skips that floor entirely so transactional tables don't accumulate stale TTL.
  • StatsCounter is opt-in (stats.enabled=false by default) and the event dispatch flag is cached at QueryHandler construction so disabled = zero hot-path cost.

Happy to scope down if any piece feels out of charter for the library — e.g. the StatsCounter half could live as a separate package and calibrate would still work on idletime alone. The full chain is what gave us the activity-aware adjustment, but the pieces are independent.

@codecov

codecov Bot commented May 27, 2026

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

acaliskol added 5 commits May 27, 2026 05:30
PHPStan annotations on PR diff were flagging __call magic methods
(set, get, exists, sadd, del, unlink, multi, exec, pipeline, etc.)
as undefined and mixed→int/string casts as unsafe.

- Redis: 13 @method declarations covering all forwarded commands.
- Cache::__construct: type-narrow config('lada-cache.expiration_time')
  via is_numeric before int cast.
- Cache::flush: type-narrow config('database.redis.options.prefix')
  via is_string instead of (string)($x ?? '').
- Cache::repairTagMembership: @param array<string> $tags.

Resolves 25 of 209 PHPStan max-level errors. Tests green (25/25).
Replaces `is_numeric ? (int) : 0` defensive narrowing with Laravel's
typed config accessor Config::integer(), which returns a strict int and
satisfies PHPStan level=max without ad-hoc type checks.

Prefix lookup keeps the manual is_string narrow because Config::string()
throws on the legacy `false` value still used by tests (TestCase::71)
and some downstream configs — a small comment now documents this BC quirk.
For large model catalogs (200+) the previous per-model
`repository->upsert()` issued one UPDATE/INSERT per row plus a cache
bust each time. Schools with ~500 cached models would burn ~500 DB
round-trips per nightly calibration.

- TtlCalibrationRepository::upsertMany(array): bulk UPSERT in a single
  SQL statement with one cache bust at the end.
- TtlCalibrationRepository::upsert(): now delegates to upsertMany.
- CalibrateCommand: buffer --apply rows; flush every batch_size models
  (default 100, configurable via `lada-cache.calibration.batch_size`).
  Residual rows flushed after the loop.

Tests: 21 passed (CalibrateCommandTest + TtlCalibrationRepositoryTest).
The CalibrateCommand reads this key with a hardcoded 100 fallback; adding
it to the published config keeps it discoverable alongside the other
calibration knobs (cache_ttl, schedule, min_samples, etc.).
@acaliskol acaliskol force-pushed the feat/auto-calibration branch from a3b9947 to 0fb5dc2 Compare May 28, 2026 14:48
Calibration now owns the activity collection path behind a single LADA_CACHE_CALIBRATION_ENABLED flag. The config introduces the feature first, moves model_ttls after the feature description, and removes separate public toggles for internal activity plumbing.

The auto schedule now uses LADA_CACHE_CALIBRATION_SCHEDULE_INTERVAL as a day interval, so the public surface is "run every N days" instead of accepting cron syntax.

CalibrateCommand uses typed dependency accessors so disabled command construction stays safe while PHPStan can prove the active path is non-null.

Rejected: Keep separate event/stat enabled env vars | duplicates the calibration switch and exposes implementation plumbing

Rejected: Keep LADA_CACHE_CALIBRATION_SCHEDULE as a cron expression | the requested public API is an N-day interval

Rejected: Type-clean the rest of LadaCacheServiceProvider in this PR | unrelated uncovered lines hurt patch coverage

Confidence: high

Scope-risk: moderate

Tested: vendor/bin/phpunit --configuration phpunit.xml --no-coverage

Tested: vendor/bin/phpstan analyse --level=max src/Console/CalibrateCommand.php --memory-limit=1G

Tested: vendor/bin/phpstan analyse --level=max src/ --memory-limit=1G (still 139 existing errors outside CalibrateCommand)

Tested: vendor/bin/pint src/LadaCacheServiceProvider.php src/Database/SqliteConnection.php --format=txt

Not-tested: Full src PHPStan is not green because of existing repo-wide errors outside CalibrateCommand
@acaliskol acaliskol force-pushed the feat/auto-calibration branch from 0fb5dc2 to cbcb3b6 Compare May 28, 2026 14:53
@spiritix

Copy link
Copy Markdown
Owner

Tagging some contributors here to discuss this feature proposal. Let me know what you guys think! Is this the right approach in your opinion? I'd like to get the community more involved for directional changes like this.

@kontainer-dam-pim @Tim-streamline @zgetro @duyphuongn @MGApcDev @michael-rubel @ogunsakin01@diegotibi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants